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Abstract: 

A framework of generalized linear point process models (glppm) much akin to glm for regression is 
developed where the intensity depends upon a linear predictor process through a known function. 
In the general framework the parameter space is a Banach space. Of particular interest is when 
the intensity depends on the history of the point process itself and possibly additional processes 
through a linear filter, and where the filter is parametrized by functions in a Sobolev space. We 
show two main results. First we show that for a special class of models the penalized maximum 
likelihood estimate is in a finite dimensional subspace of the parameter space - if it exists. In 
practice we can find the estimate using a finite dimensional glppm framework. Second, for the 
general class of models we develop a descent algorithm in the Sobolev space. We conclude the 
paper by a discussion of additive model specifications. 



1 Introduction 



Statistics for point process models is by now a vast and mature subject with a range of 
applications. Survival analysis or more generally event history analysis is perhaps the most 
notable area of application of one-dimensional point processes - or in the one-dimensional 
case we could equivalently say counting processes - with a large body of well developed 
theory. An authoritative treatment is Andersen et al. (1993). Other classical references 



include Fleming Sz Harrington ( 1991 ) and Karr ( 1991 ). The setup for statistical analysis of 
event history models is characterized by observing the occurrence of events - or transitions 
between states - for a collection of individuals. The modeling is based on intensities and it 
is paramount to incorporate covariate effects and be able to handle censoring mechanisms. 

Many other important applications of one-dimensional point processes exist such as queu- 



ing and telecommunication systems, Asmussen (|2003|) 
(2004), earthquakes 



Ogata & Katsura (1986), 



Brillinger (1992), Paninski (2004), Pillow et al 



insurance mathematics, Mikosch 



Ogata et al. (2003), neuronal activity. 



( 2008 ) , and high-frequency financial mod- 



eling Hautsch (2004), just to mention some. 



A major motivation for the present paper comes from yet another application. With the 
sequencing of the human genome and the subsequent sequencing of many other genomes 
the ground has been laid for analyzing and interpreting the blueprints of life. We analyze 
the static genomes that consist of long DNA sequences and try to identify the collection of 
functional elements that are written is this DNA code. We find the protein coding genes 
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but also a myriad of other important features such as regulatory elements, Maston et al. 
(2006). In the analysis of the genomic data a typical question is whether the occurrence 



of a given feature or sequence motif is entirely random as opposed to being organized in 
some non-random way. The traditional use of point processes in this area is mostly limited 
to specifying a null or reference distribution for randomness - a common choice is here 
the homogeneous Poisson point process. Deviations for the data from the null distribution 
is taken as evidence for the existence of an organizational structure in the data of some 
biological significance. 

One attempt to go beyond the Poisson process null model and actually model the occur- 



rences of certain motifs in the DNA-sequences is found in Gusto & Schbath (2005), which 



was also an important inspiration for our further work. The linear Hawkes processes, as 



used in Gusto & Schbath (2005), and the general class of multivariate, non- linear Hawkes 



processes, as treated in Bremaud & Massoulie (1996), were considered in our further devel 



opment of models appropriate for genomic organization. We noted a structural similarity 
of the models to the generalized linear models, and this has played a role in the imple- 
mentation, Carstensen et al. (2010). The similarity, which we for sure are not the first 



to observe, is implicitly present in several of the popular models for survival analysis, 
such as Cox's regression model and Aalens additive model, where the intensity is specified 
through a fixed function of a linear combination of covariates. A more direct relation to 



the log- linear Poisson model is illustrated in Example VI. 1.3 in Andersen et al. (1993). See 



also Whitehead (1980) and Aitkin et al. (2005). The terminology of a generalized linear 



point process model has, furthermore, been used recently for various Hawkes-type models 



of spike trains for neurons, Paninski (2004), Pillow et al. (2008), Toyoizumi et al. (2009). 



The models considered in Pillow et al. (2008) for multivariate spike trains share many 



components with our models of the occurrences of multiple transcription regulatory ele- 
ments. In particular, the use of basis expansions for estimation of functional components, 
which may be combined with regularization in terms of penalized maximum-likelihood 



estimation. In Pillow et al. (2008) the basis functions chosen were raised cosines with a 



log-time transformation, whereas we used B-splines in Carstensen et al. (2010). 



Motivated by the different applications described above we ask if there are theoretical 
results supporting any particular choice of basis functions. Or phrased differently, if we 
can understand a particular choice of basis functions as the solution of a more abstractly 
formulated problem. Clearly we have the classical result on smoothing splines in mind, 
which shows that splines appear as the solution of a particular penalized least squares 



problem, Theorem 2.4 in Green & Silverman (1994). To proceed we first develop a formal 



and abstract framework of generalized linear point process models (glppm) parametrized 
by a Banach space, and then we show two main results for a general class of models that 
includes the Hawkes processes as a special case. The first result we show is similar to the 
result on smoothing splines, and it states that the penalized maximum-likelihood estimator 
for a specific model is found in a finite-dimensional space spanned by an explicit set of basis 
functions. For the linear Hawkes process the solution is a spline. The second result is dif- 
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ferent. For the general model class considered we do not find an explicit finite-dimensional 
basis. In the alternative we derive an infinite-dimensional gradient, which suggest an itera- 
tive algorithm, and we establish a convergence result for this algorithm. The interpretation 
of the algorithm is as a sequence of finite-dimensional subspace approximations. 

The purpose of the present paper is to provide the theoretical framework for the com- 
putation of penalized maximum-likelihood estimators for functional parameters in a one- 
dimensional point process setup. For a treatment of properties of penalized maximum- 
likelihood estimators we refer to Cox & O'Sullivan (1990). The focus is here on the repre- 
sentation and computation. 



2 Setup 

We consider a filtered probability space - a stochastic basis - (fi, J-^, P) were the fil- 
tration is assumed to be right continuous. We will, in addition, assume that {Nt)t>o is an 
adapted counting process, which, under P, is a homogeneous Poisson process with rate 1. 

If {Xt)t>o is a positive, predictable process we can define the positive process, known as 
the likelihood process, by 



A = exp t + / log A,iV(ds) - Ai 



At 



Aods. 



(1) 



We will assume that < oo P-a.s., in which case (£t)t>o in general is a P-local martingale 



and a P-supermartingale with Kp{£t) < 1 for all t > 0, Theorem VI. T2, Bremaud (1981). 
If Ep(£j) = 1 we can define a probability measure Qt on by taking Ct to be the 
Radon-Nikodym derivative of Qt w.r.t. P. That is, 

Qt = Cf P. (2) 

We note that Kp{Ct) = 1 if and only if {Cs)o<s<t is a true P-martingale. If Ep(£f) < 1 we 
can not define a probability measure Qt on the abstract space by ([2j). With a more 

explicit, canonical choice of it is possible always to construct a measure Qt such that 



Qt = CfP + Qi- 



where Qj-{Nt < oo) = 0, see 



Jacod 



Jacobsen 



(2006) 



(1975) or Theorem 5.2.1(ii), 

Throughout we will fix an observation window [0,t]. The process {Xs)o<s<t is called the 
(predictable) intensity process for the counting process {Ns)o<s<t under Qt. The integrated 
intensity, (Acj)o<s<t, is the compensator, and if Kp(Ct) = 1 the process Mg = Ng — As for 
s G [0,t] is a Q(-martingale, Theorem VI. T3, Bremaud (1981). 



From a model building perspective the direct specification of the intensity process is nat- 
ural as well as practical. Practical because the construction of probability models for a 
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parametrized family of intensity processes, {Xt{(3))t>Q, for f3 £ Q, through the UkeUhood 
process construction above immediately yields the likelihood function Ct{P) for subsequent 
statistical inference. There is one small caveat though. For the specification of the prob- 
ability models to lead to a statistical model dominated by P all the likelihood processes 
need to be true P-martingales. At least on [0, t], which is equivalent to Ep(£i(/3)) = 1 for 
all /3 G 0. This is a technical obstacle, and it does not seem to be easy to formulate a sim- 
ple, general criteria. The problem is equivalent to checking whether the intensities specify 
non-exploding point processes on canonical spaces. When there is positive probability of 
explosion for some measures, and the model is thus not dominated by P, it is anyway 
sensible to compare two models, and , in terms of the Radon-Nikodym derivatives 

■l*? and 



see page 893, Kiefer & Wolfowitz (1956). If we have non- exploding data on [0,t], this 
comparison is equivalent to comparing >Ct(/3) with Ct{f3'), and though is not nec- 

essarily a true likelihood function it provides a sensible relative measure of the models 
parametrized by /3 - but only for non-exploding data. It seems that we do not need to 
check if ¥.p{Ct{j3)) = 1, but this is a delusion. If the process we observe is not explod- 
ing on [0, t] - and there may often be subject matter reasons it is not - all models with 
Kp{Ct{P)) < 1 are misspecified in a fundamental way. Arguably, the models specified by 
Qf = CtW) • P with 



Ep(£((ffl) 

are more appropriate, which is equivalent to conditioning on non-explosion. However, 
(A((/3))o<s<t is no longer the intensity process under Qf and the likelihood, >Ci(/3), is only 
known up to a normalizing constant, which in general is complicated to compute. We will 
not pursue this direction any further. 

We proceed with the general setup and let V denote a separable Banach space with the 
norm || • ||, and V* is its dual space of continuous linear functionals equipped with the 
dual norm. The dual norm is also denoted || • ||, which turns V* into a Banach space as 
well. Due to separability of V the dual space V* is separable and second countable in 



the weak*-topology (see e.g. Exercise E.2.5.3, Pedersen (1989)). We equip V* with the 
weak* Borel cr-algebra, which then coincides with the a-algebra generated by the linear 
functionals 

X ^ x/3 

for P £ V. We then consider an adapted, norm-cadlag stochastic process (Xs)o<s<t with 
values in V* . That is, Xs is an Jvg-measurable, random variable with values in V* and we 
assume that the sample path of the process is cadlag in the norm topology so that for all 
u it holds that 

lim \\Xs+e{io) - Xs{u;)\\ = 0, 
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and there is an Xg- {oj) G V* such that 

lim \\Xs-e[i0)-Xs-{uj)\\=Q. 
e^0+ 

For any (5 ^ V the real valued process [Xsl3)o<s<t is then adapted and cadlag, and 



{Xs~(i)Q<s<t is predictable, cf. Proposition 2.6 in Jacod & Shiryaev (2003). We call 



{Xs-P)Q<s<t the linear predictor process. If I? C M we introduce the set 

Q{D) = {peV\ Xs^p G D for ah s € [0,t] P-a.s.}. 

Definition 2.1. Assume that ip : D ^ [0,oo) and assume in addition that (Kj)o<s<t 
is a predictable, cadldg process with values in [0,oo). We define a generalized linear point 
process model on [0, t] to he the statistical model for a point process on [0, t] with parameter 
space Q{D) such that for f3 G Q{D) the point process has intensity 

for s £ [0,t]. 

For /3 G Q{D) we have the likelihood process Ct{P) given by ([T]) in terms of the intensity 
defined above. For the general definition we do not require it to be a martingale, and it plays 
no role for the results and computations in the present paper. However, for interpretations 
and to obtain sensible models via penalized maximum-likelihood estimation we certainly 
need to be able to verify if the process is a martingale. We discuss some possibilities below. 

The y-process in the definition serves the same purpose as in survival analysis, that is, 
it can be a simple at risk indicator process, but we keep it in the definition as a general, 
predictable, non-negative process. Note that if cp is one-to-one with inverse m = Lp~^ : 
'P>{D) —7- D then in the absence of the K-process we have 

X,/3 = m(A,). 

Drawing an analogy to ordinary generalized linear models it seems natural at this point 
to call m the link function - it transforms the intensity process into a process that is 
linear in the parameter /3. With this terminology we would call (p the inverse link function. 
However, in general there is no reason to require (p to be one-to-one, and we will not use 
the terminology. 

Whether the intensity in Definition |2.1| gives rise to a likelihood process, which is a true 
martingale, can depend quite heavily on the choice of (p. If ip is bounded the martin- 
gale condition is easy to verify, cf. Theorem VI. T4 in Bremaud] (1981). If, on the other 



hand, {Xs)o<s<t is independent of (A^s)o<s<t under P and (Ki)o<s<t is bounded, say, then 
Ep{Ct) = 1 disregarding the choice of (p. To give one additional criteria we assume for 
simplicity that Yg = 1 and Q. is the canonical space of counting processes. Then we can. 
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as mentioned above, construct a measure Q such that As = ip{Xs-P) is the intensity for 
the counting process under Q. If ^ CNg- + D and ip{x) < c\x\ + d it follows that 

A, < aNs- + 7- 



Jacobsen ( 2006 ) the counting process is not exploding under 



According to Example 4.4.5 in 
Q and by Theorem 5.2.1(ii) in 
A more refined treatment for the class of non-linear Hawkes processes focusing on stability 



Jacobsen ( 2006 ) the likelihood process is a true martingale. 



in the sense of (asymptotic) stationarity is found in Bremaud & Massoulie (1996). 

When the likelihood process is a martingale it is evident from ([l]) that as a statistical 
model with parameter space Q{D) C V the minus-log- likelihood function for observing 
{Ns)o<s<t is 

lt{(3) = I YMXs-P)ds - [ log{Ysip{Xs-l3))N{ds) (3) 
Jo Jo 

for (3 G e{D). Note that if AiV^ = 1 but 1"., = then lt{P) = oo, which simply tells us 
that the model as formulated is inappropriate for the data. In the following we therefore 
assume that this is not the case, that is, Yg > for all s with AA^^ = 1. 

For practical applications - even when V is finite dimensional ~ the maximum-likelihood 
estimator may not be well defined. One solution is to introduce a penalty function J : 
Q{D) —7- M and then to minimize the function 

m + m 

instead. We provide examples below. 

The minus-log-likelihood function is simple as a function of f3 and if is convex and log- 
concave we see that k is convex as well. The penalty function is typically also chosen to 
be convex. 

The generalized linear models for point processes have value even when V is finite dimen- 
sional, but we emphasize that models of considerably greater generality fit into the model 
class above for a suitable choice of X-process. This is at least true from a practical point 
of view where finite basis expansions can be used to approximate non-parametric com- 
ponents, and we also show one result in Section [3] where penalized maximum-likelihood 
estimation in an infinite dimensional function space reduces to penalized maximum likeli- 
hood estimation for a generalized linear model with a finite dimensional parameter space. 
Here we give a simple but well known example of how Cox's regression model fits into the 
framework. 

Example 2.2. The Cox proportional hazards model can be (re) formulated as a generalized 
linear point process model. We take Xs G M'^ to be an adapted, d-dimensional cadlag 
process, which is independent of {Ns)o<s<t under P. Then Xj is a process in (M'^)* and 
the Cox model is specified by the intensity 



\s = exp{Xj_(3)a{s) 
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where q(s) is the baseline intensity and (5 G W^. If 

loga(s) = Bs^P, 



a 



where Bs & V* is a known, adapted, norm-cadlag process with values in the dual of V and 
Pa &V we can rewrite the intensity as 



A, = exp ( iXj_ Bs-) ( ) ) = exp{Xl/3 + B^-Pa) = exp(X,^_/3) expiB^-Pa), 



-'a 



which is a generalized linear point process model with = exp, with domain D = W, and 
with parameter space M*^ x F. 

It is hardly conceivable that we can estimate the parameters in a sensible way for a single 
observation of the counting process, and in practice we will use the model with independent 
replications and corresponding intensities, A^,...,A", possibly even multiplied by an at 
risk indicator processes. Even so, it may still be desirable to penalize the Pa parameter to 
obtain a smooth fit of a, and a possible choice of penalty function is 

m = M\i3a\\ 

for A > 0. In practice we may have V = M*^' and Bt = . • • , Bt^a') where Bt^i, . . . , Bf ^i 

are known (deterministic) basis functions. If the basis functions are in t a natural norm 
on M*^ is given by the quadratic form K where 

K^,J= fB'l.B'ljds. 
Jo 

In this case, 

J(/3) = XPlKPa = X f {[\oga{s)]'f ds, 
Jo 

which is a popular choice of penalty term that we will consider below. 

Before turning to more concrete models we make one general observation about the deriva- 
tives of the minus-log-likelihood under the assumption that is diffcrentiable. 

Proposition 2.3. If D CM, is open and if cp is on D then It is Gateaux differentiable 
in P E Q{D)° if lt{P) < oo with derivative 

DkiP) = f Ys^'{Xs-p)Xs-ds - f '^)^'-'2 xs-N{ds). (4) 

Jo Jo ^y^s-P) 

Moreover, if ip is the second Gateaux derivative is 

D%{P) = j' Ys^"{Xs-P)Xs-®Xs-ds 

^"{Xs-PMXs-P) - ^'{xj_pf ^ 
- Jo ^^^^ ® ""^-"^^'^^ 
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If J is Gateaux differentiable the penalized maximum likelihood estimator in 0(D)°, if it 
exists, is then a solution to the equation Dlt{f3) + \DJ{I3) = 0. 

One way to interpret the process {Xs-(i)o<s<t is as a predictable, linear filter of the Banach 
space valued process {Xs)o<s<t- The possible linear filters are parametrized by /3 G ©(-D), 
and the objective from a statistical point of view is the estimation of /3. 

In Section [3] below we restrict our attention to stochastic processes with values in a repro- 
ducing kernel Hilbert space (RKHS), which are given through stochastic integration w.r.t. 
an ordinary real valued stochastic process. In Section [4] we generalize the class of mod- 
els to an additive model framework, where the parameter space is a product of RKHSs. 
The product space can be equipped with an inner product that turns it into a Hilbert 
space, but it can also be equipped with a 1-norm, which turns the product space into a 
Banach space. In the latter case we discuss how the natural penalization lead to an infinite 
dimensional version of a lasso estimator. 



3 Linear filters from stochastic integration 

Let g : [0,oo) — t- M be a measurable function and (Zs)o<s<t a cadlag semi-martingale. If g 
is e.g. locally bounded (which is sufficient for our purposes) the stochastic process 



/ g{s - u)dZu 
Jo 



is a well defined cadlag process. The process is sometimes called a homogeneous linear 
filter or a moving average. 

The parameter space we will consider is V = W'^''^{[0,t]), that is, V is the Sobolev space 
of functions that are m times weakly differentiable with the m'th derivative in L2([0,t]). 
For this concrete parameter space we will use g for the generic parameter - in contrast to 
the abstract notation where we use f3. 

We will need to interpret the stochastic integral above as a stochastic process with values 
in V* . Since the stochastic integral is not defined pathwise in general, it is in fact not 
obvious that 

g i-> Xsg := / g{s - n)dZ„ 
Jo 

for a fixed sample path is even a well defined linear functional - let alone continuous. 
For the pathwise definition of the stochastic integral as a linear functional we note that 
functions in TF'"'^([0, t]) for m > 1 are weakly differentiable with L2-derivatives. Hence by 



integration by parts, see e.g. Definition 4.45 and Proposition 4.49(b) in Jacod & Shiryaev 
( |2003[ ), we have that 



rs i-s 

/ h{u)dZu = h{s)Zs-h{0)Zo- / Z„_/i'(s)du (6) 
JO Jo 
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for h E H^'"'^([0, t\). This equality is in general valid up to evanescence. The right hand side 
is pathwise well defined, thus we simply use the version of the stochastic integral defined 
by the right hand side above. The integral then obviously becomes a linear functional in h 
for a concrete realization of the Z-process. Combined with Corollary |A. 2 this shows that 
we can regard {Xs)o<s<:t as a stochastic process with values in V* . Lemma A. 3 shows, 
moreover, that {Xs)o<s<t is norm-cadlag with Xs^ obviously given as 



X, 



-9 



g{s - u)dZu. 



If the function : D — >• [0, oo) is given we find that Q{D) consists of those g such that 

[ g{s - u)dZu G D for all s G [0, t] P-a.s. (7) 
Jo 

The Sobolev space 14^™''^([0, t]) can be equipped with several inner products that give rise 



to equivalent norms and turn the space into a RKHS, Wahba (1990), Berlinet Sz Thomas 



Agnan (2004). For each inner product there is an associated kernel, the reproducing kernel. 



and we assume here that one inner product is chosen with the corresponding norm denoted 
II • II and corresponding kernel denoted R : [0, t] x [0, t] — )• M. Moreover, we fix ipi, Lpi £ 
W"^''^{[0,t]) and denote by P the orthogonal projection onto span{(/9i, ...,(/?/}-'-. One of 
the defining properties of the kernel R is that for fixed s e [0,t], R{s,-) G iy"'^([0, t]), 
hence PR{s, ■) is a well defined function. This give rise to the projected kernel, which we 
denote R^ = PR. With this setup the penalty function we choose is J{g) = X\\Pg\\'^ for 
A > 0, and the penalized minus- log- likelihood function reads 



it{g) + X\\Pg\\' 



(8) 



for g G Q{D) where 



kig) 



g{s - u)dZu ds 



g{s-u)dzA)N{ds) 



With Ti, . . . , rjvj denoting the jump times for the counting process {Ns)Q<s<t we can state 
one of the main theorems. 

Theorem 3.1. If (p{x) = x -\- d with domain D = [—d,oo) then a minimizer of over 
Q{D) C l^^'^dO, t]), m > 1, belongs to the finite dimensional subspace of W'^'^^{[0,t]) 
spanned by the functions ipi, . . . ,ipi, the functions 



hi{r) 



R^{Ti - u,r)dZu 



for i = 1, . . . ,Nt together with the function 



fir) 



Ys I R^{s-u,r) dZuds. 
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Remark 3.2. A practical consequence of Theorem 3.1 is that by cohecting ipi{r), . . . , ^lir), 
f{r) and hi{r), i = 1, . . . , Nt, in an / + 1 + A't dimensional vector we reduce the estimation 
problem to a finite dimensional optimization problem. For the concrete realization we may 
of course choose whichever basis that is most convenient for this function space. For the 
practical computation of / we note that by Lemma A. 5 we can interchange the order of 
the integrations so that 

/(r) = f fY,R\s-u,r) dsdZ„. (9) 

Jo Ju 

Remark 3.3. It is a common trick to construct a model conditionally on the entire 
outcome of a process (Zs)o<s<t by assuring that Zg is /"o-measurable for all s G [0,t]. In 
this case the process 

g{\s - u\)(iZu 







for s G [0,t] is obviously predictable. Theorem 3.1 still holds with the modification that 

rt 



hi{r) 



[ RWT,-u\,r)dZu 
Jo 



for i = 1, . . . , A^i and 



/(r) = f RH\s-u\,r)dZuds. 

Jo Jo 



When we model events that happen in time it is most natural that the intensity at a 
given time t only depends on the behavior of the Z-process up to just before t. This 
corresponds to the formulation chosen in Theorem 3.1 However, if we model events in a 
one-dimensional space it is often more natural to take the approach in this remark. 

One useful choice of inner product on W"^''^{[0,t]) is given as follows. Take 



n, = {fGW"''\[o,t])\f{o) = Df{o) 

which we equip with the inner product 



D--V(o) = 0}, 



<f,g>-- 



D'^f{s)D'^g{s)ds. 



This turns T-Li into a reproducing kernel Hilbert space for m > 1 with reproducing kernel 
: [0, t] X [0, t] - 



given as 
R^{s,r) 



sAr 



(s — n)™ {r — u) 



m—l 



-du, 



see 



((m- 1)!)2 

Wahba (1990). Furthermore, define (pk{t) = t^~'^/{k — 1)1 for k = 1, . . . ,m and 



no = span{(/9i, . . .,ipm}, 
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which we equip with the inner product 

< "-mi X] bjifj >= ^ aibj, 

i j i,j 

SO that ifi, . . . , Lfm is an orthonormal basis for T-Lq. Then T-Lq is also a reproducing kernel 
Hilbert space with reproducing kernel i?*^ : [0, t] x [0, t] — )• M defined by 

m 

k=i 

Then the Sobolev space Ty"'^([0,t]) = 'Ho©'Wi is a reproducing kernel Hilbert space with 
reproducing kernel R{s,r) = R^{s,r) + R^{s,r), Hq _L "Hi, and with P the orthogonal 
projection onto Tii, PR = R^ and 

ft 



J{g) = [ {D^g{s)fds. 
Jo 



It follows by the definition of R that R^{s, •) for fixed s is a piecewise polynomial of degree 
2m — 1 with continuous derivatives of order 2m — 2, that is, R{s, •) is an order 2m spline. 
We find that e.g. the /ij-functions for the basis in Theorem |3.1| are given as stochastic 
integrals of order 2m splines. 

Example 3.4. If (Zs)o<s<t itself is a counting process and f{x) = x + d as in Theorem 



3.1 we can give a more detailed description of the minimizer of (l8|) over Q{D). We will 



also assume that the y- process is identically 1. If fii, . . . , azt denote the jump times for 
{Zs)o<s<t we find that 

hi{r)= ^ R^{Ti - (jj,r). 

j:(Tj<Ti 

Collectively, the hi basis functions are order 2m splines with knots in 

{n - aj \ i = I, . . . , Nt, j : aj < n}. 
Due to ([9]) the last basis function, /, is seen to be an order 2m + 1 spline with knots in 

{t - aj \ i = 1, . . .,Zt}. 
The cubic splines, m = 2, are the splines mostly used in practice. Here 

{s + r){sArf {s Arf 



R{s,r) = / [s — u){r — u)du = sr{s A r) h 







2 3 

and we can compute the integrated functions that enter in / as follows. If t — u < r 

r{t-uf {t-uf 



Ju 



l-t — U 

R{s — u,r)ds= / R{s,r)ds 
Jo 



6 24 
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and if t — u > s 

rt nt-u nt—u 

/ R{s — u,r)ds = I r)ds = — — + / R{s,r)dr 

Ju Jo 24 Jj. 

r'^{t — uY r^{t — u) 

24"^ 4 6 ■ 

Thus the function / is a sum of functions, the j'th function being a degree 4 polynomial 

on [0, t — cjj] and an affine function on (t — (Tj,t]. 

If Zg = Ng the process (A^s)o<s<t is under Qt known as a linear Hawkes process, in which 
case the set of knots for the /ij-functions equals the collection of interdistances between 
the points. 

Proposition 3.5. If (p is continuously differentiahle and g G Q[D)° we define rji for 
i = 1, . . . , Nt as 



and 

fair 

Then the gradient of It in g is 



rji{r) = / R{Ti - u,r)dZu 
Jo 

/ / Ygip' I / g{s - u)dZu] R{s - u,r)dsdZu. 

Jo Ju \Jo J 



i=i ^ I Jq gin - u)dZu 



The explicit derivation of the gradient above has several interesting consequences. First, a 
necessary condition for g £ Q{D)° to be a minimizer of the penalized minus-log-likelihood 
function is that g solves Vltig) + "^^Pg = 0, which yields an integral equation in g. The 
integral equation is hardly solvable in any generality, but for if{x) = x + d it does provide 



the same information as Theorem 3.1 for interior minimizers - that is, a minimizer must 
belong to the given finite dimensional subspace of PF™([0, t]). The gradient can be used for 
descent algorithms. Inspired by the gradient expression we propose a generic algorithm. 
Algorithm |3.6[ for subspace approximations. We consider here only the case where D = M 
so that Q{D) = W^''^{[0,t]). The objective function that we attempt to minimize with 
Algorithm |3.6| is 

A{g) = lti9) + X\\P9\\^ 
with gradient VA(g') = Vltig) + 2XPg. We assume here that (/? is continuously differen- 
tiahle. To show a convergence result we need to introduce a condition on the steps of the 
algorithm, and for this purpose we introduce for < ci < C2 < 1 and 5 £ (0, 1) fixed and 
g G W'™'2([0,t]) the subset 

A{g) < A{g)+ci<VA{g),g-g> 
<VA{g),g-g> > C2 <VA{g),g - g > 
-<VAig),g-9> > 6\\V Aig)\\ \\g - g\\ 



Wig) 



~geW^'\[0,t]) 
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The two first conditions determining W{g) above are known as the Wolfe conditions in 



the literature on numerical optimization, Nocedal Sz Wright (2006). The third is an angle 



condition, which is automatically fulfilled if g — g = —a'S/A{g) for a > 0. In Algorithm 



3.6 we need to iteratively choose gh, and we show that if VA(^^_i) / then under the 



assumptions in Theorem 3.7 below 

W{gh-i) n span{gh^i,VA{gh-i)} + 0, (10) 
which makes the iterative choices possible. 



Algorithm 3.6. Initialize; fix ci, C2 with < ci < C2 < 1 and b G (0, 1), set 

/oW= f tY,R\s-u,r)dsdZu, 

Jo Ju 

let ^0 S span{r/i, . . . , r/^Vt, /o} and set h = 1. 

1. Stop if VA(^/i_i) = 0. Otherwise choose 

9h G I4^(5h-i)nspan{ryi,...,r/Ar„/o,...,/h-i} 
where W{gh-i) as defined above depends on ci, C2 and 5. 

2. Compute 

fh{r) =11 Ys-^' ( / gh{s - u)dZu I R^{s - u,r)dsdZu. 
Jo Ju \Jo J 

3. Set h = h + 1 and return to 1. 

Note that the computation of fh is just as in ^ except that the y-process is 
iteratively updated. 



Theorem 3.7. If D = M, if ip is strictly positive, twice continuously differentiahle and if 
the sublevel set 

S = {g€ e{D) I A{g) < A(5o)} 



is bounded then Algorithm \3.6[ is globally convergent in the sense that 

l|VA(50l|^0 

for /i — )• oo. 
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If we for instance have strict convexity of A then under the assumptions in Theorem 3.7 we 
have a unique minimizer in S. Then we can strengthen the conchision about convergence 
and get weak convergence of towards the minimizer. In particular, we have the following 
corollary. 

Corollary 3.8. If there is a unique minimizer, g, of A in S then under the assumptions 
in Theorem 5.7 

9h{s) g{s) 

for /i — 7- oo for all s G [0, t] . 



4 Additive models 

We give in this section a brief treatment of how the setup in the previous section extends 
to the setup where the intensity is given in terms of a sum of linear filters. We restrict 
the discussion to the situation where V = 1^™'^([0, t])'^ and {Zs)o<s<t is a d-dimensional 
semi- martingale. Perceiving (7 G 1/ as a function 17 : [0, 1] — )• with coordinate functions 
in Ty™'2([0,t]) we write 

/ g{s - u)dZu = gj{s-u)dZj^u 
Jo Jo 

and just as above, by Corollary |A.2 



g i-> Xsg := / g{s - u)dZu 
Jo 



is a continuous linear function on V when equipped with the product topology. The in- 
ner product < g,h >= Yl'j=i < > with corresponding norm H^lP = X^f=il|5'i 



|2 



obviously turns V into a Hilbert space. 



The minus-log-likelihood function is given just as in the previous section, but we will 
consider the more general penalization term 

J(5) = Ar(||P5i||',.- .,11^5^11') 
where A > 0, P is the orthogonal projection on span{(pi, . . . ,ipi}^ and r : [0,00)^^ — )• 



[0,00) is coordinate- wise increasing. Theorem 3.1 easily generalizes with the following 
modification. If ip{x) = x + d then with 



rn- 
Jo 
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for i = 1, . . . ,Nt and j = 1, . . . , d a minimizer of the penalized minus- log-likelihood func- 
tions has j'th coordinate in the space spanned hy ipi, . . . , ipi together with hij, . . . ,hNt,j 
and / given by 



fir) 



RUs 



u, r)dZj^uds 



E 



YsR^{s-u,r)dsdZ, 



Theorem 3.5 also g enera lizes similarly and if r is smooth, for instance if r{xi, . . . , Xd) 



Xj, Algorithm 



3.6 



generalizes as well. 
In the alternative, we can choose r(xi, . . . , Xd) 



X]j=i leading to the penalty term 



J{g) = \Y,\\P9\l 



which gives an infinite dimensional version of grouped lasso. Since r is not differentiable. 



Algorithm 3.6 does not work directly. However, a cyclical descent algorithm may be sug- 
gested, where we cyclically decide if the coordinate function gj should be equal to or 
should be updated to decrement the objective function. The idea is then to initialize the 
algorithm with a large A and all (/^-functions equal to 0, and then in an outer loop decrease 
A in small steps and for each choice of A provide a warm start for the descent algorithm by 



using the previously estimated g. This strategy has been investigated thoroughly in Fried' 



man et al. (2010) for the ordinary lasso and its generalizations showing very promising 



performance results. 



5 Discussion 



The problem that initially motivated the present work was the estimation of the linear 
filter functions entering in the specification of a non-linear Hawkes model with an intensity 
specified as 

d 

Y.I 93{s-u)Nj{du) 







where Nj for j = 1, . . . , d are counting processes, Bremaud & Massoulie (1996). We have 



provided structural and algorithmic results for the penalized maximum-likelihood estima- 
tor of gj in a Sobolev space and we have showed that these results can be established in 
a generality where the stochastic integrals are with respect to any semi- martingale. The 
representations of basis functions and the gradient are useful for specific examples such as 
counting processes, but of little analytic value for general semi-martingales. In practice we 
can only expect to observe a general semi-martingale discretely and numerical approxima- 
tions to the integral representations and thus the minus-log-likelihood function must be 
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used. If the semi-martingale is coarsely observed it is unknown how reliable the resulting 
approximation of the penalized maximum- likelihood estimator is. 

For practical applications the R-package ppstat contains an implementation of finite- 
dimensional glppm's with a formula based model specification of additive models. Cur- 
rently the implementation only supports a quadratic penalization term, but work is on- 
going to support grouped lasso penalization as described above. The package is available 
from http : / /www . math . ku . dk/~richard/ppstat/. 



Another point worth mentioning is the similarity between Algorithm 3.6 and the functional 
gradient descent algorithm from the boosting literature, Biihlmann & Hothorn (2007). As 
in the boosting algorithm the functional estimate is iteratively updated by an additive 
component, and in one incarnation of Algorithm 3.6 this component is a scalar multiple 
of the gradient. The main difference is that we propose to compute the gradient in the 
functional space, which utilizes the inner product in that space, whereas the functional 
gradient descent algorithm computes the gradient in an ordinary euclidean space and sub- 
sequently computes an approximating functional component by a base procedure. Details 
are found in Biihlmann & Hothorn (2007). 
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A Proofs 

The Sobolev space W"^''^{[0,t]) has already been equipped with one inner product denoted 
<, > and the corresponding norm || • ||. An alternative useful inner product on W"^''^{[0,t]) 
is 

m „t 

<f,9>m=y2 D''f{s)D''gis)ds 
and the corresponding norm is given by 

m ..t 

\\f\L,2=<fJ>m=Y. ^Wds. 

It is straight forward to show that || • || and || • \\m,2 are equivalent norms. We will 
use whichever norm is most convenient in the proofs below. Note that the embedding 
VF"*'^([0, t]) ^ M^'^'^([0, t]) for m < k is continuous, which is straight forward using the 
norms || • \ \m,2 and || • \ \k,2- The continuity of the embedding holds even when A; = where 
W^°'^([0, t]) = L2{[0,t]), which is not a reproducing kernel Hilbert space. 
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We note that the characterizing property of a reproducing kernel Hilbert space is that the 
function evaluations are continuous linear functionals. If Sg denotes the evaluation in s, 
that is, Ssf = f{s), then R{s, •) as a function in Ty™'^([0,t]) represents 6s by 

f{s) =< f,R{s,-)>. 

By Cauchy-Schwarz' inequality \\5s\\ = R{s,s) and since R is a continuous function of 
both variables R{s, s) is bounded for s in a compact set. 

We have already argued that the stochastic integration of deterministic functions from 
T^"''^([0, t]) can be regarded as a pathwise, linear functional defined on Ty"'^([0,t]) for 
m > 1. The next lemma anc following corollary states the this functional is continuous. 

Lemma A.l. Let < s <t. Then the linear functional Xg : W^''^{[0,t]) — )■ M defined by 

Xsh = I h{u)dZu 
Jo 

is continuous. More precisely, we have the bound 

1/2 



|X,|| < |Z,|(l + s) + |Zo| + (^^ Zl^ds^ 



< oo. 



Proof: Note that for h G W'^''^{[0,t]) we have 

\\h\\' = \hm' + \\h'\\i 

and in particular 

||/l'l[0,s]l|2<||/l'||2<||/l||. 

Using ^ and Cauchy-Schwarz' inequality 

\Xsh\ < \h{s)Zs\ + \h{0)Zo\+ [ \Zu-h'{u)\du 

Jo 

< \Zs\\h{s)\ + \Zo\\h{0)\ + (^J^' Zl_dsy \\h'l 

< (^Zs\\\Ss\\ + \Zo\\\5o\\ + (^l^' Zl_ds^ 

< (^zs\{i+s)+\zo\+(^i^ zi_dsy ^ 

which shows the desired bound. Here we have used that for m = 1 we have R{s, s) = 1 + s 
and that Z is cadlag, hence bounded and hence in L2{[0, s]) for any s. □ 

As the embedding Vl^'"'^([0, t]) ^ ^"'^'^([0, t]) is continuous we get the following immediate 
corollary. 



1/2 

""\0,s]\\2 
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Corollary A.2. The linear functional : W"''^{[0,t]) M defined by 



h{u)dZu 



is continuous. 



Corollary A.2 shows that X, G W"'2([0,t])* for s > 0. We now show that it is also 
norm-cadlag. 

Lemma A. 3. The process {Xs)i)<s<t is a norm-cadldg stochastic process. 
Proof: For e > 

rs+e 

\Xs+eh - Xsh\ = / h{u)dZu 

Again by integration by parts 

rs+e rs+e 

/ h{u)dZu = h{s + e)Zs+e - h{s)Zs - Zu~h'{u)di 

Js+ Js+ 

and arguments similar to those in the proof of Lemma A.l gives that 



This shows that 



\X 



XM < \ \Z,^+rS. 



s+e'Js+e 



Z^,_du 



1/2 



and letting e — )• 0+ the right hand side tends to by an application of dominated conver- 
gence and because Zs+e — >• Zg and 6s+e — > Ss- This proves that the process is continuous 
from the right in norm. 

Defining Xg- by 



X,.h 



a similar argument shows that ||As_e — Xg 
process has limits from the left in norm. 



h{u)dZu 



for £ — )■ 0+, which shows that the 

□ 



To give the proof of Theorem 3.1 we will use the following general lemma. 

Lemma A. 4. // {Ht)t>o is a norm-cddldg stochastic process with values in V* then for 
t>0 the integral J^YgHs-ds defined by 



/ YgHg^pds 



(11) 
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is in V* with 



< 



\Y,\\\H,\\ds 



Proof: Clearly (11) defines for a fixed t > a linear functional on V. Moreover, since 

\Hs-f^\ < ||Fs-||||/3|| 



YsHs-^ds 



< [ \YsHs-P\ds 
Jo 



< 



|y,|||i/,_||ds \\/3\\. 



Now as {Ht)t>Q is assumed norm-cadlag it follows by continuity of the norm that \ \Hs- \ \ for 
s G [0, t] is bounded, and the integral is finite and a bound on the norm of the functional. 

□ 



Proof: (Theorem 3.1) When ip{x) = x + d we have that 



kig) 



Ys / g{s - u)dZu + dYsds - log ( / g{s - u)dZ^ + dY^ N{ds) 
Jo Jo \ Jo / 

Ys [ g{s - u)dZuds + d [ Ysds-J] log f F,^ / ' gin - u) + dYr}\ dZ^. 
Jo JO j^^i \ Jo / 



It follows from Corollary A. 2 that 

9^ gin - u)dZu 

Jo 

are continuous, linear functionals on VF"^'^([0, t]). The i'th of these continuous linear func- 
tionals is represented by rji S Ty"'^([0,t]) given as 



such that 



Hence hi = Prji 



mis) 



< Vi,g >- 



R{Ti - u,s)dZu. 
n- 

g{Ti - u)dZu- 



Combining Lemma A. 3 and A. 4 we conclude that 

g^ Ys g{s- 'u)dZ„ds 
Jo Jo 

is a continuous linear functional and rj is the representer given by 

r/(r) = Ys R{s - u,r)dZuds 
Jo Jo 
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then / = Pr]. 

Thus lt{g) is a function of a finite number of continuous, Unear functionals on VF™'^([0, t]), 

Nt 

kig) =<V,9> -^log(i;, <Vi,9> +d) + K 

i=X 

where K = d j^Ysds does not depend upon g. Assume that g G e{D) C VF'"'2([0, t]) and 
writer = + P where p G span{(/7i, . . .,Lpm,hi, . . . , /iatj, /}-^, then p ± r]i for i = 1, Nt, 
p _L r/, Pp = p and 

Nt 

k{g) + MlPgW" = <7j,g > -^logiYr,< Vi,g>+d)+K + X\\Pg\\^ 

i=l 

Nt 

= <v,9o> - ^log(Yr^ < Vi,go > +d) + K + XllPgoW"^ + \\\p\\'^ 

i=l 

> k{go) + X\\Pgo\\^ 



with equahty if and only if p = 0. Thus a minimizer of lt{g) + ^^H-PfflP over @{D) must be 
in span{(/?i, . . . , ipm, /ii, . . . , /iat^ /}. □ 

We have used the Fubini theorem below to give an alternative representation of the basis 



function / from Theorem 3.1 The result is a consequence of Theorem 45 in Protter ( 2005 ). 



With the pathwise definition of stochastic integrals, as given by (|6j), that we have used 
throughout, we can give an elementary proof. 

Lemma A. 5. With (Zs)o<s<t a semi-martingale and (Kj)o<s<t « predictable, cddldg pro- 
cess then 

ft rs— rt 

Ys / g{s - M)dZ„ds = / / Ysgis - n)dsdZ„. 
/o Jo Jo Ju 

Proof: Using ([g]) and Fubini 

Ys / g{s - u)dZ^ds = g(0) / Z,_Y,ds - Zq g{.s)Ysds + yJ Z.,_g'{s - u)duds 
Jo Jo Jo Jo Jo 

t pt rt rt 



3(0) / Z,_nds -Zo ( g{s)YAs + f Z^- I Y,g'{s - u)dsdu. 

Jo Jo Jo Ju 



To use ^ for the right hand side above we first need to verify that the integrand is 
sufficiently regular. Defining 



G{u)= [ Ysg{s-u)ds 

J u 
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for g G VF^'^([0,t]) then G is weakly differentiable with derivative 

G'{u) = - f Ysg'is - u)ds - Yug{0), 

J u 

which is verified simply by checking that G{u) = — G'{v)dv. Using this, we get for the 
right hand side above that 

/ f Ysg{s~u)dsdZ^ = G{t)Zt-G{0)Zo- [ Z^_G'{u)du 

Jo Ju Jo 



G{u) 



-G{0)Zo ■ 



Z^i— 



Ysg'{s-u)ds + Yug{0) 



du 



5(0) / Zs-Ysds -Zo f g{s)YAs + f Z^- f Y,g' 

Jo Jo Jo Ju 



{s — u)dsdu. 



□ 



g e Q{D)° is by Proposition 2.3 

ft 



Proof: (Theorem 3.5) The Gateaux derivative of It in the direction of /i G l^™'^([0,t]) for 
s by Pro 

Dlt{g)h 







g{s - u)dZu 



t ip' /q g{s - u)dZu^ r-s 
f 9{s - u)dZu 



h{s — u)dZuds 
h{s -u)dZuN{ds). 



Now just as in the proof of Theorem |3.l| - replacing Ys by Ygip' ^ /J* g{s — u)dZi^ - it 

follows that the first term above is a continuous, linear functional on VF™'^([0,t]) with 
representer fg. Moreover, with r/j as defined in Theorem 3.5 the second term above is seen 
to be a continuous, linear functional on l^™'^([0,t]) with representer 

^ _^^'[ro-g{T,-u)dZu) 
i=i (p Jq"' gin - u)dZu^ 

In conclusion, the gradient of It in g is Vlt{g) = fg — Cg- D 

Lemma A. 6. If D = M. and (p is strictly positive, twice continuously differentiable then 
the gradient VA : ^^"^'^([O, t]) — )• PF™'^([0,t]) is Lipschitz continuous on any bounded set. 



Proof: Let B{0, L) denote the ball with radius L in ^^'"'^([O, t\). Corollary A. 2 shows that 
Xs is a continuous functional and g i— )• Xg-g = Jq g{s — u)dZu is likewise continuous. 
That is, |X<j_(5r| < and s i— )• is, moreover, bounded on [0,t]. This means 
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that there is an M > such that Xg-g G [-M, M] for ah g £ B{0, L) and s £ [0, t]. Since 
ip is twice continuously differentiable we have that ip' is Lipschitz continuous on [—M, M] 
with Lipschitz constant K, say. With fg for g £ VF™'^([0,t]) as in Theorem 
that for g,g' G VF™'2([0,t]) 



3.7 



we find 



y,((^'(x,_5)-v5'(^s-5')) 



•)dZ„ds 



and as above, by the isometric isomorphism that identifies M^"^'^[0, t] with its dual, we get 
by Lemma A. 4 that if also g, g' £ B{0, L) then 



< 



< K 



f \Ys\ W'{Xs-g) - 'f'{Xs-g')\ \\X,_P\\ds 
Jo 

''\\g-g'\\ds 



\YJ Xf, 



< K / \Y,\\\Xs-f'^s\\g - g'\\. 



Ci 



Since (p is strictly positive - and twice continuously differentiable - rc i— )• Lp'{x)/ip{x) is 
Lipschitz continuous on [—M, M] with Lipschitz constant K' , say. Then for g, g' £ B{0, L) 



Nt 



< 



Nt 

E 

1=1 



ip'iXr^-g) ^'{Xr,-g') 



ip{Xr,-g) viXr^-g') 



Nt 



< K'^\\Xr,-\\\\g - g'\\\\r]i\ 

i=l 

Nt 

< K'Y^WX^^. 



m\ \\g-g 



i=l 



C2 

By Proposition |3.5| we have showed that the gradient Vlt is Lipschitz continuous on the 
bounded set B(0,L) with Lipschitz constant C = Ci + C2. Since VA = Vlt + 2XP and 
2AP is linear this proves that VA is Lipschitz continuous on bounded sets. □ 



Proof: (Theorem 3.7) We prove first by induction that it is possible to iteratively choose 
gh as prescribed in Algorithm 3.6 The induction start is given by assumption. 

Since A : VF™'2([0,t]) ^ 



3.6 



Assume that g^ is chosen as in Algorithm 
and 

Sh := {g G W^^'^^t]) I A{g) < A^)} C 5 



is continuous 



Generalized linear point process models 



23 



is bounded by assumption we find that A is bounded below along the ray gh — aVA{gh) 



for a > 0. If VA{gh) 7^ we can proceed exactly as in the proof of Lemma 3.1 in Nocedal 



Sz Wright ( 2006 ) , and there exists a > such that 

9h+i =9h- oiVA{gh) G Sh 



fulfills the two Wolfe conditions; 



A(5fe+i) < A{gh) - cia\\V K{gh) 



<VAigh+i),VA{gh)> < C2||VA(5; 



Since gn G span{/ii, . . . , /iat,, /o, . . . , fh-i} and VA{gh) G span{/ii, . . . , Hn^, fo, ■ ■ ■ , h} and 
since (jh+i — gh = —ct^A{gh) we find that 

gh+i G W{gh) n span{/ii, . . . , /iat,, /o, . . . , A} 

and the set on the right hand side is in particular non-empty. This proves that it is possible 
to iteratively choose g^ as in Algorithm |3.6[ 

For the entire sequence {gh)h>o we get from the second Wolfe condition together with the 
Cauchy-Schwarz inequality and Lipschitz continuity of VA on S that 

(c2 - 1) < VA{gh),gh+i - gh> < < VA{gh+i) - VA{gh),gh+i - cjh > 

< C\\gh+i - ghW"^ , 

which implies that 

II . . u ^ {c2 -I) <VA{gh),gh+i - 9h> 

\\9h+l - 9h\\ > jTT ^Ti ■ 

(-^ \\9h+i-9h\\ 

Combining the angle condition with the first Wolfe condition gives that 

< '^M9h),9h+i - 9h> 



A{gh+i) < Hgh) + ci\\gh+i - 9h\ 



1 5/1+1 — 9h\ 



By induction 



, s Cl(l - C2) < VA(gfe),gfe+l -fffo >^ IIV7A/- M|2 

- ""^''^ — — iivAa)ini^,,,-^,ipii^^(^^)"" 

< A(^.)-^;iii^||VA(^.)|P. 



A(5.+i) < A(5o) _ ^i(1^^2)<^ ^||vA(g,)|p. 

fc=0 



To finish the proof we need to show that A is bounded below on S, because then the 
inequality above implies that 

l|VA(5,)||^0 
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for /i — )• oo. To show that A is bounded below we observe that 

ft 



k{g) > - logiYsif g{s-u)dZu^)N{ds) 



2=1 

Nt 



^log{YrM< rii,9 >))• 



2=1 



Since this lower bound as a function of g is weakly continuous and since a bounded set is 
weakly compact by reflexivity of a Hilbert space and Banach-Alaoglu's Theorem we have 
proved that A is bounded below on the bounded set S. □ 



For the proof of Corollary 3.8 we need the following lemma. 

Lemma A. 7. If ip is strictly positive and continuously dijferentiable the map g i— )• VA{g) 
is weak-weak continuous. 



Proof: By defition of the weak topology we need to show that 

g ^< VA{g), h >=< Vlt{g), h > +2A <Pg,h> 

is weakly continuous for all h G VF'"'^([0, t]). Clearly g \-^< Pg,h >=< g,Ph > is weakly 
continuous so we can restrict our attention to g >-^< 'Vlt{g),h >. We will use Theorem 



3.5, and to do so we observe that 



9{s- u)dZu 

Jo 



for fixed s is weakly continuous by the definition of the weak topology and the fact that 
we have already shown the map above to be a continuous linear functional. We conclude 
directly from this that 



^ "p' [IP ain - u)dZu) 

g^y — ) f <Vi,h> 

(r"5(r.-n)dZ„) 



=1 ^ 

is weakly continuous as ip is assumed strictly positive and continuously differentiable. We 
finish the proof by showing that g >-^< fg,h > is weakly continuous with fg as in Theorem 
3.5 Let — > 9 for n — >■ oo in which case 



/' 

Jo 



I gn{s - u)dZu / g{s-u)dZu 
Jo 
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for all s £ [0,t]. Since the stochastic integral as a function of s is bounded on [0,t] and ip' 
is continuous, the pointwise convergence of 



Ys(p' (^j gn{s - u)dZ^ Xsh^Ygip' (^j g{s - u)dZ^ X, 
for s G [0,t] is dominated by a constant, which is integrable over [0,t]. Hence 
< fg^^h > = J Ys(p' (^j gn{s - u)<lZ^ Xghds 



f Ys(p' (^j g{s - u)dZi^ Xghds = < fg,h> . 



□ 



Proof: (Corollary 3.8) By assumption, g is the unique solution to 'VA(g) = 0. The bounded 
set S is weakly compact as argued above and the weak topology is, moreover, metrizable on 
S since VF™'^([0, t]) is separable. Therefore any subsequence of {gh)h>o has a subsequence 
that converges weakly in S, necessarily towards a limit with vanishing gradient by Lemma 



A. 7 Uniqueness of g implies that {gh)h>o itself is weakly convergent with limit g. The 



proof is completed by noting that weak convergence in a reproducing kernel Hilbert space 
implies pointwise convergence. □ 
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